Classification Models using Tidymodels
Introduction to machine learning
Background
GoalZone is a fitness club chain in Canada. GoalZone offers a range of fitness classes in two capacities . Some classes are always fully booked. Fully booked classes often have a low attendance rate. GoalZone wants to increase the number of spaces available for classes. They want to do this by predicting whether the member will attend the class or not. If they can predict a member will not attend the class, they can make another space available.
Setup
- lets load all required packages and look at their versions
pacman::p_load("tidyverse",
"tidymodels",
"magrittr",
"here",
"scales",
"glmnet",
"stacks",
"janitor",
"finetune",
"vip",
"data.table",
"DT",
"alluvial",
"extrafont",
"gt",
"gtsummary")
lst <- c("tidyverse",
"tidymodels",
"magrittr",
"here",
"scales",
"glmnet",
"stacks",
"janitor",
"finetune",
"vip",
"data.table",
"DT",
"alluvial",
"extrafont",
"gt",
"gtsummary")
as_tibble(installed.packages()) |>
select(Package, Version) |>
filter(Package %in% lst) |>
datatable(style="bootstrap", class="table-condensed",
options = list(dom = 'tp', scrollX = TRUE))doParallel::registerDoParallel()knitr::include_graphics("diction.PNG")data preparation
- booking_id : same as description with no missing values
- months_as_member : same as description with no missing values
- weight : had 20 missing values so i replaced with mean (during data preprocessing)
- days_before : had values with “days” string so i removed it to get numbers only
- day_of_week : there was repition of days with different names so i collapsed them for example ‘Fri. and Fri’ ,‘Mon and Monday’ Then ‘Wed and Wednesday’
- Time : same as description with no missing values
- category : replaced an unknown value “-” with “unknown”
- attended : same as description
# read in the data
df<-readr::read_csv("fitness_class_2212.csv")
datatable(head(df,100),style="bootstrap", class="table-condensed",
options = list(dom = 'tp', scrollX = TRUE))data processing and preparation
df |>
tabyl(day_of_week)
#> day_of_week n percent
#> Fri 279 0.186000000
#> Fri. 26 0.017333333
#> Mon 218 0.145333333
#> Monday 10 0.006666667
#> Sat 202 0.134666667
#> Sun 213 0.142000000
#> Thu 241 0.160666667
#> Tue 195 0.130000000
#> Wed 81 0.054000000
#> Wednesday 35 0.023333333
df |>
tabyl(category)
#> category n percent
#> - 13 0.008666667
#> Aqua 76 0.050666667
#> Cycling 376 0.250666667
#> HIIT 667 0.444666667
#> Strength 233 0.155333333
#> Yoga 135 0.090000000- firstly the data looks a bit messy,
- the column should have 7 levels representing 7 days of the week
- Wed and Wednesday is one thing so we collapse them to one level
- Fri. and Fri are one thing same as Mon and Monday
- we have an unknown level in category shown by “-’
- days before is supposed to be a discrete column but its showing as character just because the values contain strings (we need to fix that as well)
Preprocessing and Exploration of data
Data exploration and analysis is typically an iterative process, in which the data scientist takes a sample of data, and performs the following kinds of task to analyze it and test hypotheses:
Clean data to handle errors, missing values, and other issues.
Apply statistical techniques to better understand the data, and how the sample might be expected to represent the real-world population of data, allowing for random variation.
Visualize data to determine relationships between variables, and in the case of a machine learning project, identify features that are potentially predictive of the label.
Derive new features from existing ones that might better encapsulate relationships within the data.
Revise the hypothesis and repeat the process.
df<-df|>
mutate_at(c(5,6,7,8),as.factor) |> # change these columns to factors
mutate(day_of_week =
fct_collapse(day_of_week,Wed=c("Wed","Wednesday"),
Fri=c("Fri.","Fri"), # collapse redundant levels
Mon=c("Mon","Monday")),
category=fct_collapse(category,unknown=c("-")), # add category(unknown)
days_before=readr::parse_number(days_before)) # leave only numbers in the dataset- lets look at the data one more time
datatable(head(df,100),style="bootstrap", class="table-condensed",
options = list(dom = 'tp', scrollX = TRUE))Amazing! Our data is much more tidy-r! 💫
perfoming explanatory data analysis
df|>
mutate(weight = replace_na(weight, mean(weight, na.rm = T)),
attended = ifelse(attended==1,"yes","no"))|>
select(attended,time,day_of_week,category,weight,months_as_member) |>
tbl_summary(
by = attended,
statistic = list(
all_continuous() ~ "{mean} ({sd})",
all_categorical() ~ "{n} / {N} ({p}%)"),
label = day_of_week ~ "time of week")|>
add_p(test = all_continuous() ~ "t.test",
pvalue_fun = function(x) style_pvalue(x, digits = 2))|>
modify_header(statistic ~ "**Test Statistic**")|>
bold_labels()|>
modify_fmt_fun(statistic ~ style_sigfig) | Characteristic | no, N = 1,0461 | yes, N = 4541 | Test Statistic | p-value2 |
|---|---|---|---|---|
| time | 3.7 | 0.054 | ||
| AM | 781 / 1,046 (75%) | 360 / 454 (79%) | ||
| PM | 265 / 1,046 (25%) | 94 / 454 (21%) | ||
| time of week | 7.0 | 0.33 | ||
| Fri | 211 / 1,046 (20%) | 94 / 454 (21%) | ||
| Mon | 163 / 1,046 (16%) | 65 / 454 (14%) | ||
| Sat | 139 / 1,046 (13%) | 63 / 454 (14%) | ||
| Sun | 142 / 1,046 (14%) | 71 / 454 (16%) | ||
| Thu | 163 / 1,046 (16%) | 78 / 454 (17%) | ||
| Tue | 136 / 1,046 (13%) | 59 / 454 (13%) | ||
| Wed | 92 / 1,046 (8.8%) | 24 / 454 (5.3%) | ||
| category | 0.55 | |||
| unknown | 11 / 1,046 (1.1%) | 2 / 454 (0.4%) | ||
| Aqua | 51 / 1,046 (4.9%) | 25 / 454 (5.5%) | ||
| Cycling | 266 / 1,046 (25%) | 110 / 454 (24%) | ||
| HIIT | 454 / 1,046 (43%) | 213 / 454 (47%) | ||
| Strength | 171 / 1,046 (16%) | 62 / 454 (14%) | ||
| Yoga | 93 / 1,046 (8.9%) | 42 / 454 (9.3%) | ||
| weight | 85 (13) | 77 (10) | 12 | <0.001 |
| months_as_member | 11 (7) | 25 (17) | -16 | <0.001 |
| 1 n / N (%); Mean (SD) | ||||
| 2 Pearson's Chi-squared test; Fisher's exact test; Welch Two Sample t-test | ||||
- though the p-value is slightly above 5% ,there is some association between time of the day and attendance
- there is significant difference in mean weight and mean days as member between those who attended and those who did not
df|>
mutate(attended = ifelse(attended==1,"yes","no"))|>
group_by(attended,category,time) |>
summarize(count=n()) |>
mutate(prop=count/sum(count)) |>
ggplot(aes(x = category, y = count,fill=attended))+
geom_bar(stat = "identity")+
tvthemes::scale_fill_avatar()+
facet_wrap(~time,scales="free")+
tvthemes::theme_theLastAirbender(title.font="Slayer",
text.font = "Slayer")+
theme(legend.position = 'right',
axis.text.x = element_text(face="bold",
color="brown",
size=8,
angle=45))- its quite clear that most of the trainers attended morning sessions
- most people registered the HIIT and cycling class
statistics for weight grouped by attendance status
df|>
mutate(attended = ifelse(attended==1,"yes","no"))|>
group_by(attended)|>
summarize(count=n(),
mean=mean(weight, na.rm = T),
std_dev=sd(weight, na.rm = T),
median=median(weight, na.rm = T),
min=min(weight, na.rm = T),
max=max(weight, na.rm = T))|>
column_to_rownames(var = "attended") |>
as.matrix()
#> count mean std_dev median min max
#> no 1046 85.01258 12.97347 82.89 55.41 170.52
#> yes 454 77.09441 10.35717 74.85 57.83 121.38
#Calculate mortality rate
print(paste('total attendance rate (%) is ',round(454/(454+1046)*100,digits=2)))
#> [1] "total attendance rate (%) is 30.27"pointer
- it could be intuitive to say that the number of months that someone has been a member can affect his attendance (is it practical?😩😅)
- lets examine the number of months as a member variable
Measures of central tendency
To understand the distribution better, we can examine so-called measures of central tendency; which is a fancy way of describing statistics that represent the “middle” of the data. The goal of this is to try to find a “typical” value. Common ways to define the middle of the data include:
The mean: A simple average based on adding together all of the values in the sample set, and then dividing the total by the number of samples.
The median: The value in the middle of the range of all of the sample values.
The mode: The most commonly occuring value in the sample set*.
Let’s calculate these values, along with the minimum and maximum values for comparison, and show them on the histogram.
library(statip)
min_val <- min(df$months_as_member)
max_val <- max(df$months_as_member)
mean_val <- mean(df$months_as_member)
med_val <- median(df$months_as_member)
mod_val <- mfv(df$months_as_member)
# Print the stats
glue::glue(
'Minimum: {format(round(min_val, 2), nsmall = 2)}
Mean: {format(round(mean_val, 2), nsmall = 2)}
Median: {format(round(med_val, 2), nsmall = 2)}
Mode: {format(round(mod_val, 2), nsmall = 2)}
Maximum: {format(round(max_val, 2), nsmall = 2)}'
)
#> Minimum: 1.00
#> Mean: 15.63
#> Median: 12.00
#> Mode: 8.00
#> Maximum: 148.00
id_02_hist <- df |>
ggplot() +
geom_histogram(aes(x = months_as_member),
fill = "firebrick", alpha = 0.66) +
labs(title = "months_as_member distribution") +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())+
ggthemes::scale_fill_tableau()+
tvthemes::theme_theLastAirbender(title.font="Slayer",text.font = "Slayer")+
geom_vline(xintercept = min_val, color = 'gray33', linetype = "dashed", size = 1.3)+
geom_vline(xintercept = mean_val, color = 'cyan', linetype = "dashed", size = 1.3)+
geom_vline(xintercept = med_val, color = 'red', linetype = "dashed", size = 1.3 )+
geom_vline(xintercept = mod_val, color = 'yellow', linetype = "dashed", size = 1.3 )+
geom_vline(xintercept = max_val, color = 'gray33', linetype = "dashed", size = 1.3 )
id_02_log_hist <- df |>
ggplot() +
geom_histogram(aes(x = months_as_member),
fill = "firebrick", alpha = 0.66) +
labs(title = "months_as_member log distribution") +
scale_x_log10() +
theme(plot.title = element_text(hjust = 0.5, size = 14),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
axis.text.y = element_blank(),
axis.ticks.y = element_blank())+
ggthemes::scale_fill_tableau()+
tvthemes::theme_theLastAirbender(title.font="Slayer",text.font = "Slayer")
gridExtra::grid.arrange(id_02_hist, id_02_log_hist)- first dataset shows data without a log transformation looks like we have an outlier ( we will deal with it later)
- the maximum value looks way above threshold(guess he has been an old member of the fitness club)
- the next one shows after log transforming our data
boxplots
library(patchwork)
# Plot a box plot
p1<-df |>
ggplot(aes(x = 1, y = months_as_member))+
geom_boxplot(fill=tvthemes::avatar_pal()(1),alpha = 0.7)+
tvthemes::scale_fill_avatar()+
tvthemes::theme_avatar()+
ggtitle("Data Distribution \n with an outlier")+
xlab("")+
ylab("months as a member")+
theme(plot.title = element_text(hjust = 0.5))
p2<- df |>
filter(months_as_member<125) |>
ggplot(aes(x = 1, y = months_as_member))+
geom_boxplot(fill=tvthemes::rickAndMorty_pal()(1),alpha = 0.7)+
tvthemes::scale_fill_avatar()+
tvthemes::theme_avatar()+
ggtitle("Data Distribution \n without an outlier")+
xlab("")+
ylab("months as a member")+
theme(plot.title = element_text(hjust = 0.5))
p3<- df |>
filter(months_as_member<100) |>
ggplot(aes(x = 1, y = months_as_member))+
geom_boxplot(fill=tvthemes::hilda_pal()(1),alpha = 0.7)+
tvthemes::scale_fill_avatar()+
tvthemes::theme_avatar()+
ggtitle("Data Distribution \n without many outliers")+
xlab("")+
ylab("months as a member")+
theme(plot.title = element_text(hjust = 0.5))
p1+p2+p3- as seen before ,there is indeed an outlier here
- we have tried to remove them
subset <- df |>
dplyr::mutate(days_before=as.numeric(days_before),
attended = ifelse(attended==1,"yes","no")) |>
dplyr::select(days_before,months_as_member,weight,attended)
# Bring in external file for visualisations
source('functions/visualisations.R')
# Use plot function
plot <- histoplotter(subset, attended,
chart_x_axis_lbl = "attendence status",
chart_y_axis_lbl = 'Measures',
boxplot_color = 'navy',
boxplot_fill = '#89CFF0',
box_fill_transparency = 0.2)
# Add extras to plot
plot +
tvthemes::theme_theLastAirbender() +
tvthemes::scale_color_attackOnTitan()+
theme(legend.position = 'top') - we have an outlier as seen from previous histogram
Alluvial Diagram
tbl_summary <- df |>
dplyr::mutate(days_before=as.numeric(days_before),
attended = ifelse(attended==1,"yes","no")) |>
group_by(attended, day_of_week, category) |>
summarise(N = n()) |>
ungroup() |>
na.omit()
alluvial(tbl_summary[, c(1:4)],
freq=tbl_summary$N, border=NA,
col=ifelse(tbl_summary$attended == "yes", "blue", "gray"),
cex=0.65,
ordering = list(
order(tbl_summary$attended, tbl_summary$day_of_week=="Wed"),
order(tbl_summary$category, tbl_summary$day_of_week=="Wed"),
NULL,
NULL))are the classes balanced?
loadfonts(quiet=TRUE)
iv_rates <- df |>
mutate(attended=ifelse(attended==1,"yes","no"),
attended=as.factor(attended)) |>
group_by(attended) |>
summarize(count = n()) |>
mutate(prop = count/sum(count)) |>
ungroup()
plot<-iv_rates |>
ggplot(aes(x=attended, y=prop, fill=attended)) +
geom_col(color="black",width = 0.5)+
theme(legend.position="bottom") +
geom_label(aes(label=scales::percent(prop)), color="white") +
labs(
title = "attendance ratio",
subtitle = "Fitness analysis",
y = "proportion(%)",
x = "attendance",
fill="attendance",
caption="B.Ncube::Data enthusiast") +
scale_y_continuous(labels = scales::percent)+
tvthemes::scale_fill_kimPossible()+
tvthemes::theme_theLastAirbender(title.font="Slayer",
text.font = "Slayer")+
theme(legend.position = 'right')
plotAgainst my better judgement, am moving straight to modelling!
Data Budgeting
In this section, we allocate specific subsets of data for different tasks.
set.seed(2056)
# Load data
raw <- df
# Convert 0s and 1s (outcome) into a factor and split data
bb_split <- raw |>
select(-booking_id) |>
mutate(attended = if_else(attended==1, "yes", "no"),
attended = factor(attended)) |>
initial_split(prob=0.75)
bb_train <- training(bb_split)
bb_test <- testing(bb_split)create a model recipe
train_rcp <- recipes::recipe(attended ~ ., data=bb_train)|>
step_impute_mean(weight) |>
step_scale(all_numeric_predictors()) |>
step_dummy(all_nominal_predictors()) |>
themis::step_smote(attended, over_ratio = 0.97, neighbors = 3)
# Prep and bake the recipe so we can view this as a seperate data frame
training_df <- train_rcp|>
prep()|>
juice()
# Class imbalance resolved
after_smote <- training_df %$%
table(attended) |>
prop.table() |>
unclass()
print(after_smote)
#> attended
#> no yes
#> 0.5077419 0.4922581- we applied Synthetic Minority Oversampling - which is a nearest neighbours method of oversampling we need to check what has happened to the binary labels (negative or sick):
Training the Logistic Regression baseline model
The theory is that if a simple linear classifier does a better job than a more complex algorithm, then stick with good old logistic regression.
lr_mod <- parsnip::logistic_reg()|>
set_engine('glm') |>
set_mode("classification")
Creating the model workflow
lr_wf <-
workflow()|>
add_model(lr_mod)|>
add_recipe(train_rcp)These are easy to explain:
- I create a workflow for the model
- I add the model that I have initialized in the preceding step
- I add the recipe previously created
Next, I will kick off the training process:
lr_fit <-
lr_wf|>
fit(data=bb_train)Extracting the fitted data
I want to pull the data fits into a tibble I can explore. This can be done below:
options(scipen=999)
lr_fitted <- lr_fit|>
extract_fit_parsnip()|>
tidy()
lr_fitted
#> # A tibble: 16 x 5
#> term estimate std.error statistic p.value
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) -14.6 457. -0.0321 9.74e- 1
#> 2 months_as_member 1.63 0.119 13.7 8.84e-43
#> 3 weight -0.139 0.0771 -1.80 7.20e- 2
#> 4 days_before -0.669 0.424 -1.58 1.15e- 1
#> 5 day_of_week_Mon -1.07 0.845 -1.27 2.06e- 1
#> 6 day_of_week_Sat 0.477 0.299 1.59 1.12e- 1
#> 7 day_of_week_Sun 1.30 0.467 2.79 5.29e- 3
#> 8 day_of_week_Thu 0.0967 0.295 0.327 7.43e- 1
#> 9 day_of_week_Tue -0.414 0.659 -0.627 5.30e- 1
#> 10 day_of_week_Wed -0.947 0.504 -1.88 6.04e- 2
#> 11 time_PM -0.380 0.164 -2.32 2.06e- 2
#> 12 category_Aqua 15.1 457. 0.0331 9.74e- 1
#> 13 category_Cycling 14.8 457. 0.0324 9.74e- 1
#> 14 category_HIIT 15.0 457. 0.0329 9.74e- 1
#> 15 category_Strength 14.6 457. 0.0320 9.74e- 1
#> 16 category_Yoga 14.7 457. 0.0321 9.74e- 1I will visualise this via a bar chart to observe my significant features:
lr_fitted_add <- lr_fitted |>
mutate(Significance = ifelse(p.value < 0.05,
"Significant", "Insignificant"))|>
arrange(desc(p.value))
#Create a ggplot object to visualise significance
plot <- lr_fitted_add|>
ggplot(mapping = aes(x=term, y=p.value, fill=Significance)) +
geom_col() +
ggthemes::scale_fill_tableau() +
theme(axis.text.x = element_text(face="bold",
color="#0070BA",
size=8,
angle=90)) +
labs(y="P value",
x="Terms",
title="P value significance chart",
subtitle="significant variables in the model",
caption="Produced by Bongani Ncube")
plotly::ggplotly(plot) - only day_of_week:sunday and time:seem to be significant
Set up model (xgboost model)
xgboost_mod <- boost_tree(trees=tune(),
tree_depth = tune())|>
set_mode('classification')|>
set_engine('xgboost')Hyperparameter tuning and K-Fold Cross Validation
Here, as stated, I will do an iterative search for the best parameters to pass to my model:
# Set the selected parameters in the grid
boost_grid <- dials::grid_regular(
trees(),
tree_depth(),
levels=5) #Number of combinations to try
# Create the resampling method i.e. K Fold Cross Validation
folds <- vfold_cv(bb_train, k=5)Create XGBoost workflow
I will now implement the workflow to manage the XGBoost model:
xgboost_wf <- workflow()|>
add_model(xgboost_mod)|>
add_recipe(train_rcp)Once I have this I can then go about iterating through the best combinations of fold and hyperparameter:
We will now select the best model:
best_model <- xgboost_fold|>
#select_best('accuracy')
select_best('roc_auc')Visualising the results:
xgboost_fold|>
collect_metrics()|>
mutate(tree_depth = factor(tree_depth))|>
ggplot(aes(trees, mean, color = tree_depth)) +
geom_line(size=1.5, alpha=0.6) +
geom_point(size=2) +
facet_wrap(~ .metric, scales='free', nrow=2) +
scale_x_log10(labels = scales::label_number()) +
scale_color_viridis_d(option='plasma', begin=.9, end =0) +
ggthemes::scale_color_tableau()+
ggthemes::theme_fivethirtyeight()Finalise the workflow and fit best model
I will now finalise my workflow by selecting the best hyperparameters for the job:
final_wf <-
xgboost_wf|>
finalize_workflow(best_model)
print(final_wf)
#> == Workflow ====================================================================
#> Preprocessor: Recipe
#> Model: boost_tree()
#>
#> -- Preprocessor ----------------------------------------------------------------
#> 4 Recipe Steps
#>
#> * step_impute_mean()
#> * step_scale()
#> * step_dummy()
#> * step_smote()
#>
#> -- Model -----------------------------------------------------------------------
#> Boosted Tree Model Specification (classification)
#>
#> Main Arguments:
#> trees = 1
#> tree_depth = 4
#>
#> Computational engine: xgboost
# Final fit of our fold and hyperparameter combination
final_xgboost_fit <-
final_wf|>
last_fit(bb_split)Collect metrics for evaluation
The final step would be to collect the metrics for evaluation. We will dedicate a seperate section to the evaluation of our models:
final_xgboost_fit|>
collect_metrics()
#> # A tibble: 2 x 4
#> .metric .estimator .estimate .config
#> <chr> <chr> <dbl> <chr>
#> 1 accuracy binary 0.797 Preprocessor1_Model1
#> 2 roc_auc binary 0.849 Preprocessor1_Model1Next we will look at the workflow fit:
# This extracts the workflow fit
workflow_xgboost_fit <- final_xgboost_fit|>
extract_workflow()
# This extracts the parsnip model
xgboost_model_fit <- final_xgboost_fit|>
extract_fit_parsnip()
xgboost_model_fit|>
vip(geom = "point")Model evaluation
As we are following on from the xgboost model building, we will evaluate this first and then compare to our baseline model:
Use fitted XGBoost model to predict on testing set
The aim here is to check that our predictions match up
with our ground truth labels. The class labels will be
determined by probabilities that are higher than 0.5, however we are
going to tweak this threshold to only allow a label if the probability
is greater than 0.7:
# Pass our test data through model
testing_fit_class <- predict(workflow_xgboost_fit, bb_test)
testing_fit_probs <- predict(workflow_xgboost_fit, bb_test, type='prob')
# Bind this on to our test data with the label to compare ground truth vs predicted
predictions<- cbind(bb_test,testing_fit_probs, testing_fit_class)|>
dplyr::mutate(xgboost_model_pred=.pred_class,
xgboost_model_prob=.pred_yes)|>
dplyr::select(everything(), -c(.pred_class, .pred_no))|>
dplyr::mutate(xgboost_class_custom = ifelse(xgboost_model_prob >0.7,"yes","no"))|>
dplyr::select(-.pred_yes)Use fitted logistic regression model to predict on test set
We are going to now append our predictions from our model we created
as a baseline to append to the predictions we already have in the
predictions data frame:
testing_lr_fit_probs <- predict(lr_fit, bb_test, type='prob')
testing_lr_fit_class <- predict(lr_fit, bb_test)
predictions<- cbind(predictions,
testing_lr_fit_probs,
testing_lr_fit_class)
predictions <- predictions|>
dplyr::mutate(log_reg_model_pred=.pred_class,
log_reg_model_prob=.pred_yes)|>
dplyr::select(everything(), -c(.pred_class, .pred_no))|>
dplyr::mutate(log_reg_class_custom = ifelse(log_reg_model_prob >0.7,"yes","no"))|>
dplyr::select(-.pred_yes)
# Get a head view of the finalised data
head(predictions)
#> months_as_member weight days_before day_of_week time category attended
#> 1 17 79.56 8 Wed PM Strength no
#> 2 10 79.01 2 Mon AM HIIT no
#> 3 5 86.12 10 Fri AM Cycling no
#> 4 15 69.29 8 Thu AM HIIT no
#> 5 13 73.22 4 Tue AM Cycling no
#> 6 33 81.55 8 Thu AM Cycling yes
#> xgboost_model_pred xgboost_model_prob xgboost_class_custom log_reg_model_pred
#> 1 yes 0.5407097 no no
#> 2 no 0.4563613 no no
#> 3 no 0.3761200 no no
#> 4 no 0.4794471 no yes
#> 5 no 0.4794471 no no
#> 6 yes 0.6074300 no yes
#> log_reg_model_prob log_reg_class_custom
#> 1 0.2158120 no
#> 2 0.3621105 no
#> 3 0.1448192 no
#> 4 0.5957813 no
#> 5 0.4900168 no
#> 6 0.9115647 yesEvaluating the models with the ConfusionTableR package
The default caret confusion_matrix function saves
everything as a string and doesn’t allow you to work with the values
from the output.
This is the problem the ConfusionTableR package solves and means that you can easily store down the variables into a textual output, as and when needed.
Evaluate Logistic Regression baseline
First, I will evaluate my baseline model using the package:
cm_lr <- ConfusionTableR::binary_class_cm(
train_labels = as.factor(predictions$log_reg_class_custom),
truth_labels = as.factor(predictions$attended),
positive='yes', mode='everything'
)
cm_lr$confusion_matrix
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 240 59
#> yes 19 57
#>
#> Accuracy : 0.792
#> 95% CI : (0.7474, 0.832)
#> No Information Rate : 0.6907
#> P-Value [Acc > NIR] : 0.00000724
#>
#> Kappa : 0.462
#>
#> Mcnemar's Test P-Value : 0.00001006
#>
#> Sensitivity : 0.4914
#> Specificity : 0.9266
#> Pos Pred Value : 0.7500
#> Neg Pred Value : 0.8027
#> Precision : 0.7500
#> Recall : 0.4914
#> F1 : 0.5937
#> Prevalence : 0.3093
#> Detection Rate : 0.1520
#> Detection Prevalence : 0.2027
#> Balanced Accuracy : 0.7090
#>
#> 'Positive' Class : yes
#> The baseline model performs pretty well. You can see this is the result of fixing our imbalance. Let’s work with it in a row wise fashion, as we can extract some metrics we my be interested in:
# Get record level confusion matrix for logistic regression model
cm_rl_log_reg <- cm_lr$record_level_cm
accuracy_frame <- tibble(
Accuracy=cm_rl_log_reg$Accuracy,
Kappa=cm_rl_log_reg$Kappa,
Precision=cm_rl_log_reg$Precision,
Recall=cm_rl_log_reg$Recall
)Evaluate XGBoost Model
The next stage is to evaluate the XGBoost baseline. We will use this
final evaluation to compare with our baseline model.
Note: in reality this would be compared across many models:
cm_xgb <- ConfusionTableR::binary_class_cm(
#Here you will have to cast to factor type as the tool expects factors
train_labels = as.factor(predictions$xgboost_class_custom),
truth_labels = as.factor(predictions$attended),
positive='yes', mode='everything'
)
# View the confusion matrix native
cm_xgb$confusion_matrix
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction no yes
#> no 259 116
#> yes 0 0
#>
#> Accuracy : 0.6907
#> 95% CI : (0.6412, 0.7371)
#> No Information Rate : 0.6907
#> P-Value [Acc > NIR] : 0.5251
#>
#> Kappa : 0
#>
#> Mcnemar's Test P-Value : <0.0000000000000002
#>
#> Sensitivity : 0.0000
#> Specificity : 1.0000
#> Pos Pred Value : NaN
#> Neg Pred Value : 0.6907
#> Precision : NA
#> Recall : 0.0000
#> F1 : NA
#> Prevalence : 0.3093
#> Detection Rate : 0.0000
#> Detection Prevalence : 0.0000
#> Balanced Accuracy : 0.5000
#>
#> 'Positive' Class : yes
#> - now extract the predictions and then I will bind the predictions on to the original frame to view what the difference is:
# Get record level confusion matrix for the XGBoost model
cm_rl_xgboost <- cm_xgb$record_level_cm
accuracy_frame_xg <- tibble(
Accuracy=cm_rl_xgboost$Accuracy,
Kappa=cm_rl_xgboost$Kappa,
Precision=cm_rl_xgboost$Precision,
Recall=cm_rl_xgboost$Recall
)
# Bind the rows from the previous frame
accuracy_frame <- rbind(accuracy_frame, accuracy_frame_xg)
accuracy_frame
#> # A tibble: 2 x 4
#> Accuracy Kappa Precision Recall
#> <dbl> <dbl> <dbl> <dbl>
#> 1 0.792 0.462 0.75 0.491
#> 2 0.691 0 NA 0Conclusion
- this article served to show you some of the ways to visualise data as well as fitting logistic and xgboost using tidymodels.
- the logistic regression model performed considerably better than xgboost
References
Iannone R, Cheng J, Schloerke B, Hughes E, Seo J (2022). gt: Easily Create Presentation-Ready Display Tables. R package version 0.8.0, https://CRAN.R-project.org/package=gt.
Kuhn et al., (2020). Tidymodels: a collection of packages for modeling and machine learning using tidyverse principles. https://www.tidymodels.org
Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” Journal of Open Source Software, 4(43), 1686. doi:10.21105/joss.01686
Further reading
H. Wickham and G. Grolemund, R for Data Science: Visualize, Model, Transform, Tidy, and Import Data.
H. Wickham, Advanced R